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Dear Sir: 

1. This Declaration is being submitted to demonstrate that the claimed invention 
unexpectedly provides the ability to rapidly determine which of several activated transcription(al) 
factors is/are present in a cell or cell lysate with high sensitivity and specificity. 

2. I am an inventor on the above-identified patent application and am familiar with the 
specification and prosecution history. 

3. I have extensive experience in the field of the claimed invention as indicated in the 
attached Curriculum Vitae provided herewith as Exhibit A. 

4. The claimed invention provides a kit for accurate screening and/or quantification of 
activated transcription factor(s) present in a cell or cell lysates or for the screening and/or 
quantification of (a) compound(s) able to bind to said activated transcription factor(s) or inhibit 
the binding of said activated transcription factor(s) to a specific nucleotide sequence. In order to 
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provide specificity, the claimed invention provides: a double-stranded DNA sequence comprising 
a specific sequence which is specifically recognized by activated transcription factor(s) and a 
primary antibody or a specific hypervariable portion thereof, both being specific for the activated 
form of the transcription factor(s). In order to provide particular technical features related to the 
invention as explained below, the double-stranded DNA sequence is linked to a spacer which is a 
double-stranded nucleic acid part of at least 40 nucleotides and which is not present in the cell 
containing the activated transcription factor(s) to assay. 

5. The present invention was designed to overcome a problem in the prior art methods for 
the screening and/or quantification of activated transcription factor(s) present in a cell or cell 
lysates or for the screening and/or quantification of (a) compound(s) able to bind to said activated 
transcription factor(s) or inhibit the binding of said activated transcription factor(s) to a specific 
nucleotide sequence. The process of screening should not be overly time-consuming. The 
problem of identifying activated transcription factors is that one has to have: (A) high sensitivity 
to be able to detect low amounts of one or more activated transcription factors in a cell or cell 
extract, which represent only a small fraction of the total amount of proteins present in the cell or 
cell lysates, and (B) a high specificity to be able to distinguish among activated and non activated 
transcription factors present in the sample, but also to only bind and detect the target transcription 
factor(s). Traditionally, transcription factor activity has been studied using either Electrophoretic 
Mobility Shift Assay (EMSA), immunoblotting or reporter gene assays. The problem in the prior 
art methods is that they are quite time-consuming and at best, provide only semi-quantitative 
results. In order to obtain good sensitivity of transcription factors detection, the EMSA method 
proposes the use of short radioactive double-stranded oligonucleotide probes in the range of 20 
bp, containing the specific sequence of transcription factor binding. These radioactive probes are 
incubated in solution with cell extracts and if the transcription factors are present in the cell 
extract, they bind to their specific sequence. Samples are then resolved by native polyacrylamide 
gel electrophoresis followed by autoradiography. A retarded band, corresponding to transcription 
factor/probe complexes appears, in addition to the fast migrating band corresponding to the free 
probe. Under those basal conditions, however, only limited specificity is reached, as multiple 
transcription factors can bind the same specific sequence, i.e. if they belong to the same binding 
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family (examplified by the CREB family: Shaywitz and Greenberg (1999) Annu.Rev. Biochem. 
68:821-861, see attached). In addition, many transcription factors can bind DNA without being 
activated, their activation relies on structural modifications occurring after the binding (Shaywitz 
and Greenberg (1999) knnu.Rev. Biochem. 68: 821-861). To identify which transcription factor, 
and under which form, actually formed the complex of the retarded band, EMS A uses supershift 
experiments. An antibody which specifically recognizes the target transcription factor, and 
possibly its activated form, is added to the mix prior to electrophoresis. A slow-migrating, 
'supershifted' band appears if a complex between the transcription factor/DNA/antibody did 
form. Those skilled in the art acknowledge that supershifted bands are hardly observable and 
difficult to quantify. 

6. Therefore, EMSA is poorly specific, poorly sensitive, time consuming and does not allow 
the handling of a large number of samples. Thus the method is difficult to adapt to automation 
and is not suited for screening. In addition, it is based on the use of 32 P radioactive probes. If the 
probe has a length corresponding to the exact sequence of the binding site for the transcription 
factor (typically 4-8 nucleotides), it is not sufficient to obtain a sensitive detection, and additional 
bases are necessary to allow the formation of stable complexes between the factor and the probe. 
On the other hand, if the radioactive probe is too long , then the method has low specificity due to 
the undesired cross-binding of transcription factors present in the sample to sequences adjacent, 
or even overlapping, the specific binding site (4-8 nucleotides). 

7. To address the specificity issue, prior art assays have typically used short DNA sequences 
and to address the sensitivity issue, they have used radioactive probes which are incubated in 
solution with the transcription factors. However, theses options necessarily invoke one of the 
disadvantages raised above. 

8. We unexpectedly found that if a double-stranded DNA sequence is connected to the 
surface of the solid support via a spacer containing a double-stranded nucleic acid part which is at 
least 50 nucleotides in length, the detection of small amounts of activated transcription factors 
gains in sensitivity without loosing high specificity. We also found that using a spacer whose 
nucleic acid part is not present in the cell containing the activated transcription factors to assay 
allowed gaining in specificity without loosing the sensitivity. 
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9. Keeping the detection sensitivity together with the specificity for detection of one or more 
activated transcription factors was an unexpected result of the presently claimed invention. This 
unexpected result allowed to perform the assay in a multi-well format using a non radioactive 
detection method, i.e. using a second labeled antibody, directed against the primary antibody, 
which is conjugated with an enzyme. 

10. The ability of the claimed invention to measure the activity of transcription factors 
(simple and quantitative assay) and to rapidly yield results is extremely valuable in commercial 
applications. The kit of the present invention is currently commercialized by Active Motif 
(licensee) since 2000 and the sales are in constant increase since that date. Two hundred plates 
of 96-well assay were sold in 2000 and 2750 plates were sold in 2006. The commercialized 
product is referred to as TransAM™ kits. 

11. In order to provide high sensitivity , the claimed invention utilizes a relatively long double- 
stranded capture probe immobilized on a solid support, comprising a spacer which contains a 
double-stranded DNA sequence of at least 50 nucleotides, preferably between 50 and 250 base 
pairs in length. The use of a long spacer allows increasing the sensitivity of the method on a solid 
support for which the steric hindrance is higher than a reaction performed in solution. Also, 
there is no need to label the capture probe like in the EMSA and the cell extract can be directly 
contacted with the insoluble solid support without any further treatment. However, transcription 
factors do not bind with the same specificity to long capture sequences as to short capture 
sequences. Indeed, increasing the length of the DMA sequence which contains a specific 
transcription factor binding site also proportionally and statistically increases the number of 
binding sites for the same, but mainly for other transcription factors. Hence, a nucleotide 
sequence of 18 to 250 bp naturally involved in the regulation of the transcription of the gene, as 
taught by Peterson and proposed by the Examiner to contain the specific transcription factor 
binding site (4-8 bp) and a spacer (remaining bp), cannot be used to reach a specific transcription 
factor detection. 

12. To illustrate this purpose, the sequence of 46 bp given in example from Peterson et al., 
designed to bind NFkB (col. 13, lines 10-12), was analyzed using the TFSEARCH engine (on 
WorldWideWeb at cbrc.jp/research/db/TFSEARCH.html; limitation to the vertebrate matrix). 
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Two NFkB specific sites were identified (corresponding respectively to base pairs 10-19 and 31- 
40). However, binding sites for other factors were fouiyi to overlap these sites: NFkB site 1 is 
overlapped by sites for MZF1, GATA-3 and GATA-1, while NFkB site 2 is overlapped by sites 
for ADR1 andIk-2. 

13. Since Peterson et al. teach the possibility of use of sequences of 250 bp, we tested for the 
feasibility of increasing the 46 base pairs used un Petereson et al. to a the length of 250 bp 
naturally present in the gene containing the NFkN bindidng site. In a BLAST search performed 
on the 46 bp sequence of the example of Peterson et al., a list of sequences with high homology 
was obtaimed. The first sequence listed was the human sequence XM-941 266-2 (bp 7357- 
7339), which was identical to base pairs 24-42 of the test sequence, containing the NFkB binding 
site #2. We therefore tconsidered a 250 bp fragment from this human sequence starting from 
position 7357 (7357-7108) to cover the NfkB binding site and the adjacent sequence. Using the 
TFSEARCH analysis tool, 27 high score binding sites for transcription factors were identified in 
addition to the NFkB site (which now corresponds to bp 7341-7350), some overlapping or lying 
very close to this NFkB site. A second NFkB binding site was even identified (bp 7240-7249). 

14. Considering this sequence as a spacer would prevent the development of a specific NFkB 
assay, as transcription factors binding close to or within the NFkB binding will interfere with the 
assay. Such an assay would also not be reproducible, as different samples may contain different 
interfering transcription factors. Finally, quantification would also not be possible, as the two 
NFkB sites from the example above have different sequences, and hence different affinities for 
the factor. 

15. Using a long nucleotide sequence enables the specific nucleotide sequence to more 
efficiently bind low amounts of transcription factors. However, using a long nucleotide 
sequences also results in reduced specificity. The present invention overcomes this problem by 
linking a short specific binding sequence to a spacer containing a double-stranded nucleotide part 
of at least 40 nucleotides in length and which is not present in the cell containing the activated 
transcription factors to assay. 
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16. Exhibits 1-4 show experimenta data obtained, which exemplify of the advantages of the 
claimed kit. 

17. Exhibit 1 shows that binding of activated transcription factors to short specific binding 
sequence which are linked to a very short nucleotide spacer of 6 bp is specific, but the obtained 
signals are in average very low. In the experiment of Exhibit 1 , five activated transcription 
factors (NFkB, Elk-1, c-Myc, STAT1 and STAT3) were contacted with short double-stranded 
capture probes of 30 bp comprising the specific binding site and a spacer of 6 bp. The protocol 
provided in Example 3 of the present application was used to conduct the analysis, with several 
modifications. The probes were as in example. 1, except that the CMV spacer was replaced by a 
synthetic 5' aminated spacer. Spotting was performed directly on activated glass slides without 
streptavidin treatment, and the probes concentration was 2000 nM. The assay was conducted 
according to the protocol described in the TF Chip MAPK kit, with fluorescence detection 
(Eppendorf, Germany). Exhibit 1 compares the results obtained with the five transcription 
factors. As depicted in Exhibit 1 , only NFkB is detected with good sensitivity, the other factors 
showing either a low signal (c-Myc, STAT1 and STAT3) or no signal at all (Elk-1). Results can 
be compared with the Exhibit 3 for longer spacers. The high variability in signal detection using 
6 bp spacers is also incompatible with the simultaneous analysis of more than one factor. Thus, 
use of a short double-stranded DNA capture probe provides specificity but not sensitivity for the 
majority of the tested factors (4 out of 5). 

18. Exhibit 2 shows that binding of an activated- transcription factor (HNF3) to a short 
specific binding sequence which is linked to a long double-stranded DNA spacer of 100 bp may 
be specific or not depending on the sequence of the spacer. The short specific binding sequence 
of HNF3 is linked to the support via a spacer of 100 bp (spacer 1 ). Different spacer sequences of 
100 bp (spacer 1 to 6) are also present on the array as such (not linked to HNF3 specific binding 
sequence). The protocol described in the TF Chip Stem Cell kit (Eppendorf, Germany) was 
followed in this experiment. As depicted in Exhibit 2, the HNF3 transcription factor binds s 
specifically to its specific sequence linked to spacer 1, but also to spacer sequence 4 alone. 
Therefore, the selection of the spacer used in the assay is important to ensure sensitivity but also 
specificity. Such spacer must be designed so that it does not bind any transcription factor 
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possibly interfering with the assay. Such spacer sequence is preferably a synthetic sequence not 
present in the cell assayed for the presence of activated transcription factor(s). 

19. Exhibit 3 shows that binding of activated transcription factors to short specific binding 
sequences which are linked to a spacer comprising a double-stranded DNA sequence of at least 
40 nucleotides which sequence is not present in the questioned cell, results an assays which is 
highly specific and sensitive. The experiment was conducted as in Exhibit 1 but in addition to 
the spacer of 6 bp, synthetic spacers of 20, 50 and 100 bp were tested and the resulting signals 
were quantified. As depicted in Exhibit 3, a high signal was obtained for the five tested 
transcription factors for spacers, when the specific binding sequence is linked to spacers of 50 
and 100 bp. We also found that the signals measured with spacers below 50 bp may not increase 
linearly with the spacer size (see STAT3). This is important in the context of a microarray 
because the binding conditions for all of the factors to be assessed on the array are uniform, while 
the optimal binding conditions for each factor are different. Enhancing the signal levels using 
spacers of the lengths recited in the claims offers the possibility to evaluate the binding of 
multiple factors under uniform conditions. 

20. Exhibit 4 shows that even when a plurality of transcription factors is present in a sample, 
the use of a short specific binding sequence linked to a spacer comprising a double-stranded 
DNA sequence of at least 50 nucleotides which sequence is not present in the questioned cell, 
allows obtaining specific signals with high values (i.e. the method is sensitive and specific). Each 
capture probe of the microarray contains double-stranded DNA comprising a specific binding site 
for the TF and a common spacer of 100 bp. Exhibit 4 shows the quantification of signals 
resulting from activated TFs binding to a micro-array (TF Chip MAPK, Eppendorf, Germany). 
Extracts were obtained from HeLa cells stimulated with PMA either for 10 min or for 1 hour, and 
the protocol was as described in the TF Chip MAPK kit instruction manual. In Exhibit 4, a signal 
increase was obtained for API (c-Jun), MEF2 and p53 after 1 hour PMA stimulation of HeLa 
cells compared to the 10 min stimulation condition. There was activation at both stimulation 
times for ATF2, cMyc and ELK1. NFATcl and STAT1 were not activated at any stimulation 
time. The assay is quantitative and signal changes between different stimulation times (observed 
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times (observed for API (c-Jun), MEF2 and p53) are extremely valuable to understand the 
activation profile of a cell. 

21. In conclusion, the kit provided according to the requirements of the present invention is 
both sensitive and specific even when a plurality of transcription factors is present in the sample. 

22. I declare that all statements made herein of my own knowledge are true and that all 
statements made on information and belief are believed to be true; and further that these 
statements were made with the knowledge that willful, false statements and the like so made are 
punishable by fine or imprisonment, or both, under Section 1001 of Title 18 of the United States 
Code and that such willful false statements may jeopardize the validity of the application or 
patent issuing therefrom. ^^^^^^^ 
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Exhibit 1: Quantification of signals obtained for the binding of five activated transcription factors 
(TF) (NFkB, Elk-1, c-Myc, ST ATI, STAT3) on their respective capture molecules on a 
microarray. Capture molecules corresponding to each TF are made of double-stranded DNA and 
comprise a specific binding site and a common spacer of 6 bp. Signals for each TF were 
measured for control and stimulated cells in each factor's optimal assay conditions: NFkB: WI38 
+ inter leukin-1 (IL-1); Elk-1 : HeLa + phorbol 12-myristate 13-acetate (PMA); c-Myc: NIH3T3 + 
PMA; STAT1 : COS7 + interferony (IFNy); STAT3: HepG2 + interleukin-6 (IL-6). Signals were 
obtained with Cy3-labeled secondary antibodies and fluorescent scanning using a ScanArray 
Express microarray scanner from Packart Bioscience and a laser power of 100. Scans were 
performed with a gain = 100, except for NFkB, where a gain = 80 was used. Y axis is in relative 
fluorescence units. X axis represents the transcription factor tested. 
B ; non-stimulated cells; ^ stimulated cells. 
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Exhibit 2: Fluorescent detection of the binding of the activated hepatocyte nuclear factor (HNF)3 
transcription factor through its binding to different capture probes on a microarray. 
Capture probes spotted on the array in triplicates. 

Short sequences specifically binding different transcription factors (26 bp) linked to spacer 1 
(100 bp). The name of the transcription factor is given in blue. 
- Spacers alone (1 to 6; 100 bp) with different sequences (in red). 
-Ctrl-: spotting buffer. 
-Ctrl+: Cy3 -labeled spacer 1. 

The array was contacted with 30 |ig nuclear extract from HepG2 cells. The primary 
antibody used was a goat anti-HNF3P polyclonal IgG (Santa Cruz Biotechnology). The assay 
was performed according to the procedure described in the instruction manual of the TF Chip 
Stem Cell kit (Eppendorf, Germany). Signals were obtained with Cy3-labeled secondary 
antibodies and fluorescence scanning using a ScanArray Express microarray scanner from 
Packart Bioscience and a laser power of 1 00 and a gain of 80. 
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Exhibit 3: Quantification of signals obtained for the binding of five activated transcription factors 
(NFkB, Elk-1, c-Myc, STAT1, STAT3) on their respective capture molecules on a microarray as 
provided in Exhibit 1. Capture molecules corresponding to each TF are made of double-stranded 
DNA and comprise a specific binding site and a common spacer of increasing length of 6, 20, 50 
or 100 bp, which is a nucleotide sequence not present in the tested cells. Y axis is in relative 
fluorescence units. X axis represents the spacer size (bp). 
^ : non-stimulated cells; M stimulated cells. 



-11- 



Appl. No. 
Filed 



10/821,568 
April 8, 2004 



TF Chip MAPK: comparison of the TF a ctivation profiles obtained for Hela 
cells stimulated with PMAfor 10 min and 1h 



CO 

c 
g> 
w 

"O 

<D 

O 

B 

o 
o 



70000 
60000 
50000 
40000 
30000 
20000 
10000 
0 



1 10 min ■ 1H 



■ 





API c-Jun 



ATF2 



cMyc B.K1 MB^2 

Transcription factor 



NFATd 



STAT1 



Exhibit 4: Quantification of signals obtained for the binding of activated transcription factors 
(TFs) in nuclear extracts of HeLa cells after PMA stimulation at two different times using the TF 
Chip MAPK kit (Eppendorf; Germany). Spots corresponding to each TF contain double-stranded 
DNA comprising a specific binding site for the TF and a common spacer of 100 bp. Signals for 
each TF were measured for 10 min stimulation with PMA (A) and for 1 hour stimulation with 
PMA (B). Signals were obtained with Cy3-labeled secondary antibodies and fluorescence 
scanning using a ScanArray Express microarray scanner from Packart Bioscience and a laser 
power of 100. Scans were performed with a gain = 80. 

Y axis represents the fluorescence signals as generated by the TF Chip MAPK kit software. X 
axis represents the transcription factors that can be detected using the TF Chip MAPK kit. 
B : 10 min stimulation; 1 : 1 hour stimulation. 
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Abstract — One of the goals of systems biology is the 
identification of regulatory mechanisms that govern an 
organism's response to external stimuli. Transcription factors 
have been hypothesized as a major contributor to an 
organism's response to various outside stimuli, and a great 
deal of work has been done to predict the set of transcription 
factors which regulate a given gene. Most of the current 
methods seek to identify possible binding sites from genomic 
sequence. Initial attempts at predicting transcription factors 
from genomic sequences suffered from the problem of false 
positives. Making the problem more difficult, it has also been 
shown that while predicted binding sites might be false 
positives, they can be shown to bind to their corresponding 
sequences in vitro. One method for rectifying this is through 
the use of phylogenetic analysis in which only regions which 
show high evolutionary conservation are analyzed. However 
such an approach may be too stringent because of the level of 
degeneracy shown in transcription factor binding site 
position weight matrices. Due to the degeneracy, there may 
be only a few bases that need to be conserved across species. 
Therefore, while a sequence may not show a high level of 
evolutionary conservation, these sequences may still show 
high affinity for the same transcription factor. In predicting 
transcription factor binding we explore the notion that ''Co- 
expression implies co-regulation" [Allocco et al. BMC 
Bioinformatics 5:18, 2004]. With multiple genes requiring 
similar transcription factors binding sites, there exists a basis 
for eliminating false positives. This method allows for the 
selection of transcription factors binding sites that are active 
under a given experimental paradigm, thereby allowing us to 
indirectly incorporate the effects of chromosome and recog- 
nition site presentation upon transcription factor binding 
prediction. Rather than having to rationalize that a few 
transcription factors binding sites are over-represented in a 
cluster of genes, one can show that a few transcription factors 
are active in the cluster of genes that have been grouped 
together. Although the method focuses on predicting exper- 
iment-specific transcription factor binding sites, it is possible 
that if such a methodology were used in an iterative process 
where different experiments were analyzed, one could obtain 
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a comprehensive set of transcription factors binding sites 
which regulate the various dynamic responses shown by 
biological systems under a variety of conditions hence 
building a more comprehensive model of transcriptional 
regulation. 

Keywords — Corticosteroids, Gene expression, Transcription 
factor binding site, Phylogenetics. 

INTRODUCTION 

In the wake of the completion of various genome 
projects, it was remarked that given the length of the 
genome, it was surprising that so much of it was not 
devoted to coding for protein products. 31 However, 
taking a more holistic approach in the analysis, one 
realizes that given the ability of complex systems to 
respond to a wide range of external stimuli, the ratio of 
nucleotides devoted to the non-coding region vs. the 
coding region should not be surprising. Treating the 
DNA sequence as the master control program for an 
organism, it would follow that the majority of the se- 
quence should be devoted to the dynamic aspect of the 
response or program logic, rather than mere storage 
for protein sequences. Researchers have begun to view 
the non-coding "junk" DNA as equal in importance to 
the coding regions due to their role in the regulation in 
mRNA levels and hence protein production. Without 
precise control of protein production via the non- 
coding regions, an organism would be nothing more 
than a static bag of different molecules and be unable 
to respond to changes in the environment. 27 

Transcription factors work by binding to specific 
sequences upstream of the coding region and either 
increase or decrease the affinity of RNA polymerase 
for the sequence, thereby altering the rate of mRNA 
production. 1 The binding of these transcription factors 
has been determined to be sequence specific through 
various binding experiments. 33 Previous work by 
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Wasserman et al., have shown that this fact can be 
used to predict the existence of regulatory motifs 
within the DNA sequence. However, given the rela- 
tively short lengths of these recognition sites raging 
from 6 to 14 bases 26 ' 39 as well as the degeneracy pos- 
sible with each given transcription factor binding site, 
the probability of a random hit is quite high. More 
problematic in this evaluation is that the transcription 
factors can be shown to bind in vitro even if they show 
no in vivo activity. This suggests that there exist other 
conformational factors that regulate whether a given 
sequence in the DNA is available for binding. 

Most researchers have tackled the problem of false 
positives via the method of phylogenetic footprint- 
ing. 2,5,7 " 11,16 ' 21 ' 30 The core assumption in phylogenetic 
footprinting is that significant control mechanisms in 
an organism are evolutionarily conserved. Therefore, 
by utilizing the genomes of multiple related organisms, 
one should be able to identify conserved regulatory 
regions within the DNA. The primary benefit of this 
technique is that it limits the search space for which 
possible transcription factors binding sites can be 
found. This technique is exemplified by tools such as 
CONSITE, 36 and FOOTER, 11 which look for se- 
quence homologies between two different species. 
CONSITE represents the basic phylogenetic analysis 
technique presented by Wasserman et al. 39 in which 
only sequences which show high homology between 
two species such as Rat and Human would be analyzed 
via Position Weight Matrices (PWM) in order to 
determine which transcription factors binding sites are 
present. The primary difference between these and 
other tools concerns the different ways in which 
homologous sequences are identified. 

An important point of concern with phylogenetic 
analysis lies in the relative degeneracy of transcription 
factor binding matrices. 19 In many cases such as the 
transcription factor RE-1, a regulator of neuronal 
development, it was found that the transcription factor 
binding site had regions of high degeneracy, specifi- 
cally that only 12 out of the 21 positions are highly 
conserved. 43 Due to this fact, it is conceivable that a 
transcription factor can bind across multiple species 
with significantly different recognition sequences. 28 
Therefore, if sequence conservation is the primary 
driving force for the phylogenetic analysis of the pro- 
moter region, many important regions could be dis- 
carded. The consequence of this is that in many cases, 
the transcription factor binding sites predicted from 
the homologous sequences will be unable to satisfy the 
notion that co-expression implies co-regulation since it 
will be hard to detect a consistent set of transcription 
factor binding sites. 

In this paper, we will show that provided that there 
is a high level of correlation within a set of clustered 



genes, there is sufficient information to extract a small 
set of transcription factor binding sites that can be 
hypothesized to co-regulate the genes in question. This 
is similar to the use of transcription factor enrichment 
to rationalize clustering results, 29 or the prediction of 
regulatory models from a series of experiments. 37 
However, while studies have shown the predilection of 
transcription factors within groups of co-expressed 
genes, we will show that if the co-expressed genes have 
a correlation coefficient above a certain threshold, then 
a great majority of genes (>90%) will contain a small 
subset of transcription factor binding sites in common. 
This will eliminate many of the false positives and yield 
a set experimentally consistent transcription factors. 
Additionally, we will show that promoter regions that 
have been preprocessed via phylogenetic footprinting 
does not show an increased probability of containing 
transcription factor binding sites over that of the 
baseline sequence, suggesting that either phylogenetic 
footprinting is unable to preferentially select for reg- 
ulatory regions, or that there are non-evolutionarily 
conserved regulatory sites in the sequence. 

METHODS 

Data Collection and Gene Expression Measurement 

The microarray data was obtained from an experi- 
ment that was conducted to examine the behavior of a 
bolus injection of corticosteroids upon temporal gene 
expression profile of living cells. This dataset was 
specifically chosen due to the a priori knowledge that 
corticosteroids have powerful transcriptionally medi- 
ated effects upon the rat experimental model. The data 
collection and preliminary analysis were previously 
presented in. 4 The data is available in the GEO data- 
base under the accession number GDS253. 

Identification and Classification of Relevant Gene 
Expression Profiles 

After the data has been obtained, it is important for 
the expression profiles of relevant genes to be ex- 
tracted. This step essentially seeks to extract genes 
whose expression profiles are actively being mediated 
by transcription factors as part of a transcriptional 
regulation pathway. By doing this, it ensures that the 
genes that were selected and grouped are part of the 
same transcriptional response mechanism and there- 
fore should show clear trends when conducting tran- 
scription factor analysis. 

Preliminary examination of the data lead to the 
observation that different clustering algorithms yielded 
inconsistent results which were different in the number 
of optimal clusters, or the genes which were grouped 



Context Specific Transcription Factor Prediction 



1055 



together. 34 Further analysis suggested that the data 
itself was antagonistic to data clustering due primarily 
to the fact that no clear boundaries existed. Common 
data selection techniques such as various filters build 
around data quality checks like Asymetrix's Absent, 
Present or Marginal flags or selecting genes that 
showed expression levels which changed by greater 
than 2x up or down yielded a subset of data which still 
was not clearly partitionable. SLINGSHOTS was an 
attempt at combining both clustering and selection in 
order to obtain a subset of genes in which boundaries 
could be seen. 

We recently proposed a novel algorithm for the 
identification and classification of relevant gene 
expression profiles called SLINGSHOTS (SeLection of 
INformative Genes via Symbolic Hashing Of Time 
Series). 42 The key motivating argument for this meth- 
od is the realization that in the presence of noise and 
uncertainties associated with measuring mRNA 
abundance, looking for exact correlations or distance 
metrics between gene pairs may not necessarily yield 
the most informative interpretation. On the contrary, 
robust, coherent and dominating qualitative features 
and similarities could be a more informative proxy for 
the information content of the expression experiment. 
With our approach, the raw data is transformed into 
sequences of events, or symbols, and these are further 
analyzed for consistencies. Our algorithm is based on 
the assumption that genes that are relevant to the 
underlying dynamics of the system have two essential 
characteristics. The first is that they are part of a 
concerted mechanism and should possess expression 
profiles which are temporally consistent with the 
expression profiles of other genes involved in related 
molecular mechanisms. The second assumption is that 
the dynamics of the set of informative genes out to 
show significant deviations in their aggregate activity 
from their initial baseline activity distribution. There- 
fore, our algorithm performs a fine-grained clustering 
which results in hundreds of clusters. We then evaluate 
the ability of a subset of these individual clusters to 
satisfy these two constraints, thereby linking the 
selection process with the clustering result. The 
advantage of this technique is that we are able to 
perform data selection with clustering quality in mind 
and parse the contribution of each cluster to the 
overall dynamics of the system. 

SLINGSHOTS uses the notion that genes which are 
part of large highly correlated set of genes are more 
likely to be significant based on the assumption that an 
organism responds to outside challenges to homeo- 
stasis through the utilization of a set of genes which are 
highly controlled in both their expression levels and 
temporal evolution. It has already been shown that 
genes which show a high degree of correlation in their 



expression profiles tend to be involved in related 
functions. 3 There is an additional qualifier, that given 
significant perturbations to the experimental system, 
that a large number of genes with coordinated re- 
sponses need to be brought online to deal with the 
challenge to homeostasis. 

SLINGSHOTS deterministically clusters expression 
profiles into a large set of putative clusters via a 
hashing process. Hashing is utilized to decompose an 
expression profile into a single integer. Expression 
profiles with the same integer have very similar 
expression profiles. The hashing methodology used is 
the one proposed by Lin et al. 24 What hashing 
accomplishes for our purposes is the grouping of 
expression profile into a large number of punitive 
clusters all with a similar range of correlation coeffi- 
cients. The procedure for going from an expression 
profile to a hash value is given in Appendix 1. 

After the genes have been put into their respective 
clusters, the next task is to identify which of these gene 
clusters are actively participating in the experimental 
response. Given that these experiments attempt to 
perturb the homeostatic balance forcing the organism 
into a different transcriptional state, the algorithm se- 
lected clusters that when combined yield a significant 
deviation in the distribution of expression level values 
from that of baseline. Therefore, one should be looking 
for genes which alter the distribution of up-regulated 
and down-regulated expression levels during the course 
of the experiment, thereby pointing to their active role 
in changing the transcriptional state of the organism. 
Given that there are hundreds of clusters generated via 
the hashing step, a greedy selection algorithm was 
implemented in which the peaks are added in the order 
of their population. The overall algorithm is given in 
Appendix 2. The results of SLINGSHOTS is given in 
Fig. 1 indicating the 12 clusters that were identified as 
informative and will be further discussed in the Results 
section. 

Identification of Possible Transcription Factor Binding 
Sites 

The identification of possible transcription factor 
binding sites is broken down into two steps: (i) the 
identification of the promoter region, (ii) the identifi- 
cation of putative transcription factor binding sites. 
CORG 14 was used for the identification of promoter 
regions as well the identification of relevant tran- 
scription factor binding sites. CORG was selected 
primarily for its ability to extract the 5' upstream re- 
gion up to the next gene rather than to a set number of 
upstream base pairs. This was important to us due to 
the nebulous concept of how far upstream a promoter 
region lies. It has been shown that the GRE 
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FIGURE 1. A sample cluster obtained from SLINGSHOTS. All of the clusters show a reasonably correlation to the average 
normalized profile. 



(Glucocorticoid Response Element) could be found 
thousands of base pairs upstream of the start codon. 5 
Other such as TRED 44 on the other hand require as a 
parameter the number of upstream base pairs to con- 
sider. Additionally by using CORG, one is able to 
utilize its built it facilities to both extract homologous 
sequences as well as transcription factor binding sites. 

One complication which needed to be addressed was 
the fact that CORG returned homologous sequences 
between two species and is unable to return just the 
entire promoter region for a single species. In order to 
compensate for this drawback, the evaluation was 
conducted in the following manner. To evaluate the 
difference between phylogenetic footprinting and our 
proposed approach of looking at the promoter regions 
of a set of clustered genes in aggregate, a CORG search 
was conducted upon human/rat and mouse/rat. The 
human/rat case is the baseline example of phylogenetic 
footprinting in which ideally there will be a small set of 
regulators which give rise to the similar responses to 
corticosteroids in humans and rats. The mouse/rat case 
was used to give a proxy for the context specific case in 
which the analysis is performed only on the rat pro- 
moter region and to determine the transcription factors 
which are present in all of the genes in the cluster. The 
rationale for running this case is that the rat/mouse 
promoter regions have about an 85% conservation 
rate among homologous sequences, 41 and are therefore 
genetically very similar. Given this high level of con- 
servation between the two different species as well as 



the fact that CORG keeps sequences that show a 
homology of greater than 70% over 100 base pairs, 13 it 
provides a reasonable facsimile for the rat promoter 
region. 

Verification of the results was initially going to be 
conducted by comparing our selected transcription 
factors with known transcriptional regulators via 
RnPD. 41 However, an initial evaluation of the selected 
and clustered genes revealed that there was insufficient 
data on known binding sites in order to make any sort 
of meaningful assessment. 



Data Analysis 

The primary metric which to be analyzed is the 
number of times a transcription factor binding site is 
found in the 5' region of genes that comprise up of a 
highly correlated cluster. This is necessary in order to 
determine whether or not there are any transcription 
factor binding sites which were present in a sufficient 
percentage of genes where it would be a reasonable 
candidate for the co-regulation of the genes within the 
cluster. Secondly, once the metric is quantified, it will 
be possible to ascertain the overall distribution of 
transcription factors throughout the cluster of genes, 
allowing one to determine whether or not the highly 
conserved transcription factor was present due to a 
statistically significant event, or whether it was highly 
conserved due to chance. 
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FIGURE 2. The Occurrence rate of transcription factor binding sites when random genes are grouped together. (Top) The 
exponential distribution. (Bottom) A Log linearized version of the plot. (Note: the tail end agrees well with the overall exponential 
distribution). 



The process of finding a hit for a specific sequence in 
the promoter region can be modeled by an exponential 
distribution whose PDF is given in Eq. (1). In Fig. 2, a 
random set of genes was selected and a distribution 
that relates the number of transcription factors to the 
number of genes that a given transcription factor is 
predicted to bind to is given. From this distribution, it 
appears that the initial assumption that one can model 
transcription factor occurrence rate on a cluster of 
gene as an exponential distribution. This also functions 
as a negative control. If the genes were randomly se- 
lected, then the distribution of transcription factors/ 
cluster ought to match the exponential distribution. If 
there are deviations from this exponential graph near 
the tail end representing conservation of a significant 
number of transcription factors at levels higher than 



would be expected, then it would suggest the presence 
of a significant co-regulation mechanism. 

pdfW-ie-* (1) 

To obtain the parameters for the PDF, the mean 
number of times a transcription factor-binding site is 
present amongst the genes in a cluster as well as the 
standard deviation this distribution is calculated. Gi- 
ven the slight discrepancy between the two values, the 
average of the mean and the standard deviation is used 
as the parameter with which to model the distributions. 
The fits of the distributions for the 12 clusters are 
shown in red on Figs. 3 and 4. The exponential dis- 
tribution will then allow us to obtain the probability 
that a single transcription factor will be conserved over 
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FIGURE 3. Distribution for the rat/mouse case that shows the number of transcription factors that are found in a given number of 
genes for the different clusters. Upon the initial observation, we find that the distribution can be modeled as an exponential 
distribution. The red curve was obtained via parameter estimation from the distribution mean and standard deviations. 




FIGURE 4. Distribution for the rat/human case that shows the number of transcription factors that are found in a given number of 
genes for the different clusters. Upon the initial observation, we find that the distribution can be modeled as an exponential 
distribution. The red curve was obtained via parameter estimation from the distribution mean and standard deviations. 
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x% of the time. This probability will be used below to 
calculate the expected number of highly conserved 
transcription factors. 

After the exponential distribution had been fitted, it 
then becomes possible to calculate the expected num- 
ber of transcription factors that ought to be highly 
conserved given the exponential distribution. If one 
has been able to filter out false positives, then one 
should find that the number of transcription factors 
that are actually conserved should be less than the 
expected value. The statistical significance of the 
number of transcription factors which are actually 
conserved will be calculated via the binomial distri- 
bution Eq. (2) under the assumption that the presence 
of a given transcription factor above any given con- 
servation rate can be modeled as a random process. 

P(n\N)=( N n )pV-p) N - n (2) 



RESULTS 

The classification and selection step yielded 12 clusters 
with a total of 529 probe set, of which the clustering 
results are given in Fig. 1. The 529 probe sets corre- 
spond to 454 genes of which 339 genes had entries in 
the CORG database. The most important property of 
these clusters is the high level of correlation between all 
of the genes in the cluster. Data has presented that 
suggests that for genes to have a greater than baseline 
chance of having transcription factors in common, the 
correlation coefficient should be greater than 0.75. 3 
Our clusters show an average correlation coefficient of 
0.85, comfortably over the limit. In the transcription 
factor dataset which they based these conclusions off 
of, 22 they found that only 37% of the genes actually 
showed significant experimental binding to transcrip- 
tion factors. So while with a .85 correlation in the 
signal suggests only a 50% commonality between two 
genes, we believe that biologically the percentage in 
mammalian systems is quite higher due to the relatively 
sparse nature of the isolated yeast transcription fac- 
tors. Additionally, we believe that if a transcription 
factor can be shown to be over-represented the less 
than perfect correlation, it has a greater chance of 
being significant compared to those which are not 
over-represented within a cluster. 

Figures 3 and 4 show the distribution of transcrip- 
tion factor binding sites that are conserved over a 
certain number of genes in a cluster. The values on the 
x-axis are dependent on the overall number of genes in 
a cluster, and the values on the ^-axis denote the 
number of transcription factor binding sites that were 
present in a given number of genes. The results of this 



plot seem to suggest that the distribution of tran- 
scription factor binding sites amongst the genes in a 
given cluster can be modeled via an exponential dis- 
tribution. Given that the exponential distribution given 
in Eq. 1 , is primarily defined by the parameter A,, which 
is the mean and the standard deviation of an expo- 
nential distribution, the means/standard deviations for 
the number of times a transcription factor binding site 
was present in a gene of that cluster is shown in 
Table 1. It is notable that these values are reasonably 
close reinforcing our assumption that the exponential 
distribution is a good fit for the data. To obtain the 
exponential fits given in Figures. 3 and 4, the means 
and the standard deviations were averaged for each 
cluster to obtain a single consistent value half way 
between the means and the standard deviations. 
Looking at the parameters, there was a direct corre- 
lation between the parameters themselves and the 
number of genes in a cluster. This linear relationship is 
illustrated in Fig. 5, where the parameters are plotted 
against the number of genes in a cluster. This fact will 
be revisited during the discussion. 

A cutoff of 95% was set to determine which tran- 
scription factors ought to be examined. In the case 
where the mouse/rat promoter region was analyzed, it 
was found that there were one or more transcription 
factors that was present on average 99.7% of the time 
in each of the clusters. In the case where the human/rat 
promoter region was analyzed, the most conserved 
transcription factor was present in only 85.2% of the 
genes of a given cluster. From this immediate result, it 

TABLE 1 . THE STATISTICS WHICH DESCRIBE THE DISTRI- 
BUTION OF TRANSCRIPTION FACTORS PER CLUSTER. 



Transcription factor occurrence statistic 





Mouse/Rat 




Human/Rat 




Cluster 


Mean number 


Standard 


Mean number of 


Standard 




of occurrences 


deviation 


occurrences 


deviation 


1 


7.26 


6.44 


3.44 


3.80 


2 


10.95 


10.80 


5.59 


5.74 


3 


6.78 


6.02 


3.90 


3.15 


4 


9.65 


9.57 


5.68 


5.89 


5 


5.51 


4.60 


3.21 


2.97 


6 


11.01 


11.68 


5.66 


6.69 


7 


5.81 


4.48 


3.51 


2.99 


8 


7.54 


6.69 


3.41 


3.44 


9 


8.41 


6.58 


4.33 


4.07 


10 


5.96 


5.00 


3.11 


3.15 


11 


7.56 


6.64 


3.42 


3.45 


12 


10.40 


10.10 


4.63 


5.78 



The similarity between the means and standard deviations suggest 
that the distribution can be modeled via an exponential distribution. 
The results of this chart suggest that the primary driving force in the 
number of times a transcription factor is found within a gene is the 
length of the promoter region being analyzed. 
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FIGURE 5. The parameters used to fit the exponential dis- 
tribution vs. the cluster population. This linear trend further 
reinforces our contention in our belief that the CDF repre- 
senting the number of times a transcription factor is present 
amongst a set of genes is governed only by the length of 
sequence analyzed. Note that in both the cases, the parame- 
ters show a good linear fit. This suggests that phylogenetic 
footprinting in the Rat/human case has not selected for se- 
quences in the promoter region that show a greater number of 
correct promoter regions. 

would seem that there is a sizable gap in terms of the 
ability of the phylogenetic analysis conducted via 
CORG in the rat/human case to obtain transcription 
factors that are likely candidates for the co-regulation 
of the genes in the cluster. The transcription factors 
which were highly conserved in both cases are given in 
Table 2, 3 with Table 2 utilizing a lower cutoff of 80% 
of the genes for rat/human and Table 3 utilizing a 
cutoff of 95% for rat/mouse. Different cutoffs were set 
given the fact that in the rat/mouse case, there was a 
transcription factor present in 99.7% of the genes/ 
cluster whilst in the rat/human case, the most con- 
served transcription factors were only present at only 
85.2% of the time. Further investigation of the 
parameters that were used for the exponential fits 
suggested that the means and the standard deviations 
in the human/rat case were roughly half that in the 
mouse/rat case. What makes this association more 
interesting is the fact that after phylogenetic foot- 



TABLE 2. TRANSCRIPTION FACTORS CONSERVED MORE 
THAN 80% OF THE TIME BETWEEN HUMAN AND RAT. 



Cluster 



Transcription factors 



1 

3 

4 

5 

7 

9 

10 

12 



STAT 6 
STAT 6 
STAT 6 
STAT 6 
STAT 6 
STAT 6 
TEF-1 
STAT 6 



STAT5 
TEF-1 



Note that 4 of the clusters (2,6,8,1 1) do not contain highly con- 
served transcription factors and that all of the transcription factors 
are those that are highly represented in the genome. 



printing through CORG, the sequence being analyzed 
by position weight matrices has decreased roughly by 
half. This suggests that the hit rate of the transcription 
factors is sequence independent, and that the two re- 
sults despite having very different cutoffs have the 
same overall characteristic. 

Random analysis was conducted to ascertain the 
significance of these transcription factors. Thirty ran- 
dom genes were grouped from the microarray data and 
the same procedure was conducted upon this synthetic 
cluster. What was found was that 3 of the transcription 
factors that were highly conserved in both the rat/hu- 
man case and the rat/mouse case were also found in a 
random sampling of the data. These transcription 
factors are TEF, and STAT5, and STAT6. Removing 
these transcription factors from consideration, it was 
observed that the rat/human homologous promoter 
case has no transcription factor that is conserved in 
more than 80% of the genes. In fact, there are no 
transcription factors that are conserved in more than 
75% of the genes in any given cluster. In contrast to 
this, when TEF1, STAT5 and STAT6 where removed, 
8 out of the 12 clusters still had transcription factors 
that were conserved in more than 95% of the case, with 
the remaining four clusters containing transcription 
factors that were conserved more than 90% of the 
time. The transcription factors that are conserved more 
than 95% of the time which are not STAT6, STAT5, 
and TEF1 are highlighted in red in Table 3. This 
suggests that aside from the global non-specific acti- 
vation of transcription, in our specific experimental 
data, phylogenetic analysis in the human/rat case has 
been unable to find a reasonable candidate for co- 
regulation. 

Given the following facts, the distribution of tran- 
scription factors amongst genes in a cluster, the 
parameters that fit the distribution, and the fact that 
there are 457 possible transcription factors, one can 
begin to calculate the probability of a the number of 
transcription being highly conserved within a cluster in 
the rat/mouse case. This evaluation was not conducted 
in the rat/human case due to the fact that did not exist 
a set of transcription factors which can be hypothe- 
sized to co-regulate the set of genes. Excluding the 
transcription factors STAT5, STAT6 and TEF1 and 
assuming a conservation rate of greater than 95% one 
has a 4% chance of finding a transcription factor. This 
is consistent due to the linear relationship between the 
cluster size and the mean values. Given 457 possible 
transcription factors, this would lead to an expected 
value of 18. Therefore in a random case one would 
expect 18 transcription factor to be conserved over 
95% of the time. However, what is found that there are 
between 1 and 8 transcription factors being highly 
conserved. This result suggests that solely by looking 
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TABLE 3. TRANSCRIPTION FACTORS CONSERVED MORE THAN 95% OF THE TIME BETWEEN MOUSE AND RAT. 



Cluster Transcription factors 



■J 


STAT 5 


STAT 6 


TEF-1 






2 


STAT 5 


STAT 6 


TEF-1 


CDX 




3 


STAT 6 


TEF-1 


AP2 ALPHA 






4 


STAT 5 










5 


STAT 5 


STAT 6 








6 


STAT 5 


STAT 6 


TEF-1 






7 


STAT 5 


STAT 6 


USF1 






8 


STAT 5 


STAT 6 


TEF-1 


GE II 


CDX GATA6 


9 


STAT 5 


STAT 6 


TEF-1 


GE II 


CDXA AP2 ALPHA PAX4 




GATA6 


CIZ 


SRY 


USF1 




10 


STAT 5 


STAT 6 


GATA6 






11 


STAT 5 


TEF-1 


STAT 6 


CDX 


AP2 ALPHA 


12 


STAT 5 


STAT 6 


TEF-1 







In contrast to the human/rat case, all of the clusters show transcription factors conserved more than 95% of the time as well as transcription 
factors which are highly conserved and not found in a random sampling of genes. 



at the genes that are clustered with a very high corre- 
lation it is possible to throw out a significant number 
of transcription factors which may be indicative of 
false positives. The associated /7-value assuming a 
binomial distribution in this case ranges from 
1.58 x 10" 7 to 5.27 x 10" 3 . 



DISCUSSION 

The main point of phylogenetic analysis has been 
the reduction of false positives in transcription factor 
binding predictions. However, it is our hypothesis that 
one cannot perform such reduction if the result of the 
operation cannot satisfy the notion that co-expression 
implies co-regulation. We believe that by performing 
phylogenetic analysis between human and rat as well 
as utilizing mouse and rat to extract a homologue for 
the rat promoter region, it has been shown that phy- 
logenetic footprinting does a poor job in keeping the 
necessary transcription factors that would co-regulate 
clusters of co-expressed genes. Therefore, it is our 
contention that due to this fact, phylogenetic foot- 
printing utilizing sequence information only may not be 
the best way to tackle the issue of false positives. One 
may argue that the notion of requiring that all of the 
genes in the highly correlated clusters must have a set 
of common active regulators is a naive approach. In 
spite of the simplicity of this approach, the proposed 
method was still able to find a small subset of tran- 
scription factors that were highly conserved across all 
of the genes in a given cluster. 

Our second contention is that performing phyloge- 
netic footprinting does not yield results that were 
characteristically different than in the case where 
phylogenetic footprinting was not performed. In both 



cases, there was an observed exponential distribution 
with parameters that vary by the total amount of base 
pairs analyzed. We had expected that while there were 
numerous false positives generated via standard tran- 
scription factor binding site prediction that transcrip- 
tion factor binding sites were more prevalent in "true" 
regulatory regions that were conserved through evo- 
lution than over the baseline rate. However, we did not 
find a greater affinity for transcription factor binding 
sites to be localized to regions of evolutionary con- 
servation than over that of non-evolutionary con- 
served segments of the 5' region. So while there was a 
difference in the parameters for the rat/human case vs. 
rat/mouse case, it was not due specifically to the 
presence of certain conserved regions that were present 
in the different species, but rather due only to the 
length of the sequence being analyzed. Had there been 
a true species dependent conservation of phylogenetic 
footprinting, then the correlation between the param- 
eters which fit the curves in Figs. 3 and 4, ought not to 
be accurate correlated with the length of the promoter 
sequence to be analyzed. 

This leads to the hypothesis that the primary driving 
force in the number of times a given transcription 
factor occurs within a gene cluster is driven by the 
length of the promoter region analyzed. Furthermore, 
the general fit of the exponential distribution in both 
cases suggests that the phylogenetic footprinting does 
not add information to the system. If phylogenetic 
footprinting in its current formulation is correct, then 
it should be able to extract a set of regulatory hotspots 
in which the presence of transcription factors were 
over-represented. If this were the case, then there 
wouldn't be a correlation between the parameters for 
the exponential distribution and the length of the 
promoter regions being analyzed. However, this was 
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found not to be the case. There was no greater prob- 
ability for transcription factors in the regions con- 
served via phylogenetic footprinting. 

While it has been shown that the probability of a 
transcription factor "hit" is dependent upon the length 
of the sequence analyzed, the number of transcription 
factors that are actually conserved over a high number 
of genes cannot be. If the probability for a transcrip- 
tion factor to be highly conserved in >95% of the 
genes per cluster is around 4% one would expect 
around 18 transcription factors to show similar con- 
servation. However, we find that this is not the case, 
The number of transcription factors that are highly 
conserved in the rat/mouse case range from 1 to 8 after 
the highly non-specific transcription factors have been 
eliminated. This is due primarily to the fact that while 
the exponential distribution is a reasonable fit for the 
data, the tail end of the distribution, i.e. the highly 
conserved transcription factors, deviate from the 
exponential distribution. 

What is evident in Fig. 6 is that the graph is bimodal 
with a linear regime that describes the random occur- 
rences of transcription factors, and a nonlinear regime 
in which the transcription factors show a non-random 
occurrence rate. This suggests that by lumping the 
genes by their expression profile together, it allows one 
to isolate a set of transcription factors that have a high 



probability of being active under the experimental re- 
gime. Taking into account the random trial, one can 
further cut down on the number of isolated tran- 
scription factors by removing the non-specific initia- 
tors of transcription, i.e. those that are part of 
widespread signaling cascades, 

In Fig. 7, we illustrate what we term the "parameter 
gap". In all of the cases shown in Fig. 6, we were able 
to get a better fit in terms of the R 2 value if we fitted the 
genes that were conserved over a few genes. This 
"parameter gap" allows for the determination of both 
the limits of the expected number of transcription 
factor binding sites in a given cluster and the limits of 
the conservation rate. In this case, the bounds for the 
expected number of transcription factors are 0-18, and 
the bounds for the .conservation rate is 80-100%. This 
allows us to discount human/rat phylogenetic case as 
isolating any transcription factor binding sites that co- 
regulate a cluster, and allows us to calculate the p- 
values for the number of transcription factor binding 
sites in a cluster. 

While the presence of STAT6 and STATS are highly 
non-specific, we feel that the results are still rather 
interesting. Given the relative promiscuity of the 
transcription factor for genes, we believe that presence 
of STAT5 and STAT6 show the relative widespread 
effects of the various JAK-STAT pathways that are 
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FIGURE 6. Log Normalized version of Fig. 3. The lines are fits obtained by fitting those transcription factors that are not highly 
conserved within a cluster. What is evident is that there are a number of transcription factors at the tail region that cannot be 
adequately modeled by the exponential distribution suggesting a non-random preference for a given cluster of genes. 
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Parameter Gap 
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FIGURE 7. This illustrates the gap caused by fitting all of the 
data, and fitting only the first ten data points. The number of 
informative transcription factors should be less than the ex- 
pected value if the exponential distribution is estimated from 
ail of the data, and greater than the expected value of the 
exponential distribution accounts only for the first ten points. 
ft 2 value in with the first 10 data points is 0.74 and 0.69 when 
all of the data was used. 



activated by cytokines and growth factors.' 5 Further 
examination of their binding matrices on TRANSFAC 
shows the fact that the STATS and STAT6 are highly 
nonspecific with base specificity in 3/8 and 4/8 of the 
binding matrix, leading to a high rate of positive hits in 
the promoter region. We hypothesize that the relative 
promiscuity of the STATS and STAT6 transcription 
factor makes it a possible candidate as one of the 
primary initiators of transcription, and that it is the 
other cluster specific transcription factors binding sites 
that serve to control the relative shapes of the expres- 
sion profiles. 

However, for many of the other transcription fac- 
tors such as CDX (Caudal-type homeodomain pro- 
tein), AP2-Alpha (activating enhancer binding protein 
2), USF (Upstream Stimulating Factor), GATA6 
(GATA Binding Protein 6), and PAX4 (Paired Box 
Gene 4) their presence within the various clusters are 
more specific. Utilizing information from iHOP, 18 it is 
possible to establish links between them and the effects 
of corticosteroid administration. For the transcription 
factors GE II, Sry and CIZ, there was not sufficient 
information about their functions in the context of 
corticosteroid response to make a meaningful evalua- 
tion. 

For these five transcription factors, the RG-U34A 
microarray had expression data on four of them 
(CDX, USF1, AP2-Alpha and PAX4) . Of these, CDX 
and USF were down-regulated after the administration 
of corticosteroids, while the rest of the transcription 
factors are up-regulated. The down-regulation of CDX 
may be evidence of the suppression of proliferation by 
corticosteroids. CDX has been characterized as a reg- 
ulator of cancer cell proliferation and is often up-reg- 
ulated in malignant tumors. 17 Therefore, it would 



follow that the down-regulation of this transcription 
factor would lead to the suppression of cellular pro- 
liferation, one of the hallmarks of malignant tumors. 
The down-regulation of USF is characteristic of the 
decrease in lipid and glucose metabolism by the liver, 23 
leading to the increase in the levels of circulating free 
fatty acids and glucose in the bloodstream leading to 
the associated steroid induced diabetes. 

The up-regulation of AP2-Alpha could again be 
evidence of the suppression of cellular proliferation by 
corticosteroids. AP2-Alpha has been cited as a tumor 
suppressor, 38 and combined with the down regulation 
of CDX, it may point to the mechanism by which 
corticosteroids suppress cellular proliferation. PAX-4 
is normally associated with the differentiation of beta 
islet cells in the pancreas. This is consistent with the 
observation that the levels of circulating glucose are 
increased via administration of corticosteroids. How- 
ever while its presence in the liver has not been sub- 
stantiated in the literature, it is conceivable that given 
that it is active in one organ under administration of 
corticosteroids, that it could play a less visible though 
still important role in the liver as well. An interesting 
question that arises from this observation is whether or 
not the differentiation of beta islet cells in the pancreas 
is driven primarily by the levels of circulating glucose 
levels, or whether it is driven by the levels of cortico- 
steroid. 

We acknowledge that there is significant disagree- 
ment between the results that we have obtained and 
those obtained via phylogenetic analysis. However, we 
feel that our results are correct, given its success at 
identifying possible co-regulators. The fact that the 
algorithm has identified a very small subset of tran- 
scription factors that show significant biological roles 
related to the pharmacological effect of corticosteroid 
leads us to believe that the algorithm has been suc- 
cessful in predicting transcription factor/gene relations. 
Such data will allow us to build regulatory networks 
that can be used to build PK/PD models which will 
allow us to predict the behavior of the system under 
different dosing conditions. 

If the disagreement between the results obtained from 
the presented method and phylogenetic footprinting is a 
paradox rather than a contradiction, an interesting 
possibility arises. Currently, there is a similar and per- 
haps related paradox in the field of transcriptional net- 
work analysis. It has been widely noted that maps of 
transcriptional interactions appear to have a scale-free 
topography in which the distribution of links between 
different genes follows an exponential distribu- 
tion. 20,25,35 However, is has also been observed that 
despite the apparent scale free nature of the network, 
biological transcription networks illustrate a higher 
degree of robustness than could be normally explained 
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via a scale free network. 6 Specifically that the removal of 
a large number of hubs are not lethal to an organism. It 
has been shown that in yeast, the removal of 28 out of 33 
highly connected hubs did not lead to the death of the 
given yeast cells 6 with little correlation between the 
connectivity of a node and its importance to viability. 
Additionally, simulations which explore the evolution 
of metabolic networks have resulted in networks that 
contain the existence of hubs but do not exhibit a clear 
power law 32 in their network connectivity suggesting 
there are non scale-free elements in the overall network. 

Both the analysis of the random clusters of genes as 
well as the transcriptional networks obtained via 
phylogenetic analysis seem to confirm the existence of 
a scale-free network as evidenced by the exponential 
distribution of links between transcription factors and 
a set of genes. Such an observation can be justified by 
the fact that some transcription factors appear to be 
highly selective while others such as STATS and 
STAT6 appear to be highly promiscuous. However, as 
shown in Fig. 6, there also appears to be a significantly 
non-exponential portion to the distribution. This sug- 
gests that the genes in a cluster and their respective co- 
regulators may not follow a scale free network. Our 
hypothesis is that while the overall topography of a 
network is scale free, if one were to look at important 
response pathways, one may obtain a sub-graph with a 
different topography given the need to maintain a high 
degree of robustness. This means that there are certain 
co-expressed genes in which the pathway is common 
over multiple organisms which have a evolutionarily 
conserved transcription factor binding sites. However, 
there are also genes that augment this central pathway 
which contain regulatory regions that may be species 
specific. 

We find this notion attractive given the fact that it 
has been shown that rats and humans often have dif- 
ferent responses to medication or treatment regimens 12 
despite the fact that the same primary pathway is being 
targeted. Given the relative importance of the highly 
connected hubs in many different biological processes, 
these auxiliary genes would allow the system to 
maintain consistency within the primary response 
pathway in the presence of significant cross-talk be- 
tween different signaling pathways as well as pertur- 
bations such as disease or injury. 

If this were the case, then it would allow us to rec- 
oncile our results with those obtained through stan- 
dard phylogenetic footprinting. The network 
corresponding to the links extracted via phylogenetic 
analysis may correspond to the primary response 
pathways, i.e. those that code for enzymatic products 
that the organism uses to deal with alterations to 
homeostasis are evolutionarily conserved. The results 
obtained via our algorithm includes this primary re- 



sponse pathway as well as the extra links that give the 
network a composite characteristic rather than a sim- 
ple scale free architecture. 

Assuming that this hypothesis is correct, then the 
following questions arise: What are the properties of 
this network; Can we find this sub-network efficiently 
given the properties; If we identify this network, can 
we show that the genes that make up the nodes of the 
network are co-expressed? If these questions can be 
answered in the affirmative, then it would give a 
powerful tool to molecular biologists in the identifi- 
cation of key pathways. Currently, we can only pro- 
vide a vague notion as to what the property of this 
transcriptional sub-network would be, namely that it 
should be robust to the removal of highly connected 
hubs, i.e. the removal of a hub would not separate the 
network into two disjoint subsets thus rendering it 
non-functional. 



CONCLUSION/FUTURE WORK 

The primary goal behind the prediction of tran- 
scription factor binding sites is the creation of a global 
gene interaction network that can be used to predict an 
organism's response to different stimuli. Therefore, it is 
our contention that any sort of network must be 
coherent with experimental results. Our initial analysis 
of the results of transcription factor binding sites via 
phylogenetic footprinting suggests that oftentimes this 
is not the case, and that there were many genes that 
were co-expressed that did not appear to be co-regu- 
lated under the experimental regime. While it is plau- 
sible and highly likely that unrelated regulatory factors 
can lead to co-expression, it is our belief that the bulk 
of co-regulated genes ought to have similar regulatory 
mechanisms given prior work by Wolfe et al. 40 
Therefore we focused whether it was possible to predict 
a set of transcription factors that would give rise to the 
observed co-expression. Our method focused primarily 
upon the notion that instead of eliminating false pos- 
itives by comparing the predictions between different 
organisms, we ought to be able to eliminate false 
positives by comparing predictions between different 
genes which show the same response. 

Ideally, there results between the two methods 
should agree to a large extent. However, the results 
which we obtained were different from those obtained 
through phylogenetic analysis, and lead to the fol- 
lowing conclusions. Either there was a paradox and 
both methods gave correct answers, one of the meth- 
ods is correct, or neither of the methods is correct. Out 
of these possibilities, we found the first consequence 
the most intriguing because if one assumes the cor- 
rectness of both, it provides a mechanism for the 
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possible elucidation of primary response pathways in a 
highly connected network structure, an explanation for 
the phenomenon of differing side effects in different 
organisms, and resolving the paradox of a highly ro- 
bust scale-free network. 

Additionally, we believe that we have found only 
the transcription factors that are active under a given 
condition, which is not the overall set of transcription 
factors. We believe that with additional experiments of 
the response of an organism under different conditions 
it would be possible for us to isolate a set of tran- 
scription factors that are active under those conditions 
in order to obtain a clearer picture as to the overall 
regulatory structure of the organism as a whole. 
Therefore an iterative processes in which every new 
experiment yields a few transcription factors can 
eventually lead to a more complete picture as to the 
network regulatory structure. 
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APPENDIX 1 



1 . The normalization of the gene expression profile to 
N(0,1) via the z-score transform. 

2. If the sequences are longer than 10 time points, 
piecewise averaging is conducted, i.e. averaging to- 
gether sets of n time points to reduce the exponen- 
tial expansion of the search space. In the case of our 
data, the 17 time points are interpolated to 18 time 
points, and the time series are broken down into 
sets of 2 to be piecewise averaged 

3. These piecewise averaged points are then converted 
into symbols through the use of Gaussian break- 
points. Gaussian breakpoints are divisions in the 
Gaussian distribution such that the cumulative 
probability of each section are equivalent. These 
can be obtained through the use of CDF tables 
found in statistics text books or by solving the fol- 
lowing equation for b: 



The overall process of assigning a letter to each 
piecewise averaged point is illustrated in below: 



l+erf b 



■)]' 



fc- 1 

i = 1, ..,k\k = number of breakpoints; 
b = breakpoint value 



• Normalized 



- Piecewise Averaged 



-Raw 1 




6 8 10 12 14 16 18 20 
time (arbitrary units) 



4. After the symbolic transformation, the series of 
symbols is converted into a single integer via the 
formula: 

w 

hash(c,w,a) = 1 -h^[ord(c / ) - 1] x a*'* 
>=i 

Where c is the letter assigned to each piecewise aver- 
aged point, a is the size of the alphabet, 27 and w is the 
total length of the expression profile divided by the 
number of points per piecewise average. 31 The 
parameters of the alphabet were selected to so that the 
population distribution of motifs is non-exponential, 
to reflect the non-random distribution of expression 
profiles present in the data, w was chosen to preserve 
as much of the high frequency component of the signal 
as possible. 



APPENDIX 2 



(i) k = 0, S(k) = 0, D(k) = ~oo, max = -~ 

(ii) k = k+ 1 

(iii) /?'= arg max N{h), N(h) = number of genes with corresponding 
hash value h 

(iv) G(k) = {gthashigi) = h), the subset of genes that hash to h 

(v) Evaluate fl[VyO); t = 0 T; g, 

(vi) Evaluate D(k) = maxmax|F[V ff/ (0] - F[Y 9l (0)}\ 

(vii) If D(k) > max 9i€Z 

(viii) Max = D(k); F=lc, 

(ix) Go to (ii) until all peaks have been added 

(x) For a = 1 to F 

(xi) Select I = S(a - 1) u G(a) 
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